-
Notifications
You must be signed in to change notification settings - Fork 0
Add two regression datasets: California Housing and Diabetes #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.
Changes:
- Added
california_housingdataset config + regeneration script (and accompanying data/zip artifacts). - Added
diabetesdataset config + CSV + regeneration script (and accompanying zip artifact). - Updated README dataset table; minor formatting cleanup in
zip_files.py.
Reviewed changes
Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| causalbench-asu/tests/zip_files.py | Minor formatting / quoting updates for zip utility. |
| causalbench-asu/tests/data/diabetes/download_data.py | Script to regenerate the Diabetes CSV from sklearn. |
| causalbench-asu/tests/data/diabetes/diabetes_data.csv | Added Diabetes dataset CSV. |
| causalbench-asu/tests/data/diabetes/config.yaml | Added dataset config for Diabetes. |
| causalbench-asu/tests/data/diabetes.zip | Added packaged dataset zip. |
| causalbench-asu/tests/data/california_housing/download_data.py | Script to regenerate the California Housing CSV from sklearn. |
| causalbench-asu/tests/data/california_housing/config.yaml | Added dataset config for California Housing. |
| README.md | Updated dataset list table to include the new datasets + minor whitespace cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type: continuous | ||
| data: decimal | ||
| sex: | ||
| header: sex | ||
| type: continuous | ||
| data: decimal | ||
| bmi: | ||
| header: bmi | ||
| type: continuous | ||
| data: decimal | ||
| bp: | ||
| header: bp | ||
| type: continuous | ||
| data: decimal | ||
| s1: | ||
| header: s1 | ||
| type: continuous | ||
| data: decimal | ||
| s2: | ||
| header: s2 | ||
| type: continuous | ||
| data: decimal | ||
| s3: | ||
| header: s3 | ||
| type: continuous | ||
| data: decimal | ||
| s4: | ||
| header: s4 | ||
| type: continuous | ||
| data: decimal | ||
| s5: | ||
| header: s5 | ||
| type: continuous | ||
| data: decimal | ||
| s6: | ||
| header: s6 | ||
| type: continuous | ||
| data: decimal | ||
| target: | ||
| header: target | ||
| type: continuous |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.
| type: continuous | |
| data: decimal | |
| sex: | |
| header: sex | |
| type: continuous | |
| data: decimal | |
| bmi: | |
| header: bmi | |
| type: continuous | |
| data: decimal | |
| bp: | |
| header: bp | |
| type: continuous | |
| data: decimal | |
| s1: | |
| header: s1 | |
| type: continuous | |
| data: decimal | |
| s2: | |
| header: s2 | |
| type: continuous | |
| data: decimal | |
| s3: | |
| header: s3 | |
| type: continuous | |
| data: decimal | |
| s4: | |
| header: s4 | |
| type: continuous | |
| data: decimal | |
| s5: | |
| header: s5 | |
| type: continuous | |
| data: decimal | |
| s6: | |
| header: s6 | |
| type: continuous | |
| data: decimal | |
| target: | |
| header: target | |
| type: continuous | |
| type: ratio | |
| data: decimal | |
| sex: | |
| header: sex | |
| type: ratio | |
| data: decimal | |
| bmi: | |
| header: bmi | |
| type: ratio | |
| data: decimal | |
| bp: | |
| header: bp | |
| type: ratio | |
| data: decimal | |
| s1: | |
| header: s1 | |
| type: ratio | |
| data: decimal | |
| s2: | |
| header: s2 | |
| type: ratio | |
| data: decimal | |
| s3: | |
| header: s3 | |
| type: ratio | |
| data: decimal | |
| s4: | |
| header: s4 | |
| type: ratio | |
| data: decimal | |
| s5: | |
| header: s5 | |
| type: ratio | |
| data: decimal | |
| s6: | |
| header: s6 | |
| type: ratio | |
| data: decimal | |
| target: | |
| header: target | |
| type: ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type is not used atm
see
CausalBench/causalbench-asu/causalbench/modules/dataset.py
Lines 76 to 112 in a95d5cc
| for column, col_data in data.columns.items(): | |
| if data.headers: | |
| col_df = file_df[col_data.header] | |
| else: | |
| col_df = file_df[col_data._index] | |
| if col_data.data == 'integer': | |
| if not pd.api.types.is_integer_dtype(col_df): | |
| raise TypeError(f'Data type mismatch for column {column}') | |
| if 'labels' in col_data: | |
| labels = sorted(col_data.labels) | |
| data_labels = sorted(file_df[col_data.header].unique()) | |
| if labels != data_labels: | |
| raise ValueError(f'Labels do not match for column {column}') | |
| if 'range' in col_data: | |
| start = col_data.range.start | |
| end = col_data.range.end | |
| min1 = min(file_df[col_data.header]) | |
| max1 = max(file_df[col_data.header]) | |
| if not (start <= min1 <= end and start <= max1 <= end): | |
| raise ValueError(f'Range does not match for column {column}') | |
| elif col_data.data == 'decimal': | |
| if not pd.api.types.is_float_dtype(col_df): | |
| raise TypeError(f'Data type mismatch for column {column}') | |
| if 'labels' in col_data: | |
| labels = sorted(col_data.labels) | |
| data_labels = sorted(file_df[col_data.header].unique()) | |
| if labels != data_labels: | |
| raise ValueError(f'Labels do not match for column {column}') | |
| if 'range' in col_data: | |
| start = col_data.range.start | |
| end = col_data.range.end | |
| min1 = min(file_df[col_data.header]) | |
| max1 = max(file_df[col_data.header]) | |
| if not (start <= min1 <= end and start <= max1 <= end): | |
| raise ValueError(f'Range does not match for column {column}') |
existing configs either leave it blank or set it as ratio or nominal
based on schema, it can be defined quite arbitrarily (
CausalBench/causalbench-asu/causalbench/modules/schema/dataset.yaml
Lines 67 to 70 in a95d5cc
| type: | |
| anyOf: | |
| - type: string | |
| - type: 'null' |
I change it to blank
update zip_files update docs
Summary
This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.
Datasets Added
1. California Housing Dataset
2. Diabetes Dataset
Changes
Dataset Files
Each dataset includes:
config.yamlconfiguration file following CausalBench schemadownload_data.pyscript to regenerate data from sklearnDeliverables
causalbench-asu/tests/data/.zipfiles for each datasetDesign Decisions
Testing
All datasets successfully load through the CausalBench framework: